In response to a severe lack of reporting within government sources, The Washington Post compiled a database of every fatal police shooting in the United States from 2015-2022. We are interested in exploring this data, specifically as it relates to differences between U.S. states and regions.
This exploratory data analysis is divided into five main parts: first, we organize the data; second, we perform some basic statistical analyses; third, we reshape the data for state- and region-based comparative analyses; fourth, we ask a SMART research question about our data and attempt to answer this question. Finally, we will use the result of our research SMART question and impose a modeling SMART question.
In part 3 the data is reshaped and new data is added. This is for both the first part of this project (Midterm) and the latter part (Final).
To Look at the Modeling Part of this Project, Please move down to line 1053, where part 5 starts.
If you would like to only run things with
First we call our packages. Then we read the data set that comes from a csv file called FPS22.csv.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ purrr 0.3.5
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ plotly::filter() masks dplyr::filter(), stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
After accounting for null values, the data set we are working with has 6,574 observations. Below we have provided a single sample observation:
| Name | Date | Manner of Death | Armed | Age | Gender | Race | City |
|---|---|---|---|---|---|---|---|
| Tim Elliot | 10/04/2022 | Shot | Gun | 53 | M | A | Shelton |
| State | Signs of Mental Illness | Threat Level | Flee | Body Camera | Longitude | Latitude | Is Geocoding Exact? |
|---|---|---|---|---|---|---|---|
| WA | 1 | TRUE | Not fleeing | FALSE | -123 | 47.2 | TRUE |
The total number of observations:
## [1] 5720
We provide some basic statistics about 2015-2022 fatal police shootings in the United States, using information from the Washington Post data set.
Mean age of victims of police violence:
## [1] 36.7
Median age of victims of police violence:
## [1] 34
Frequency graph for the age of victims of police violence:
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
Frequency graph for the race of victims of police violence:
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
Frequency graph for the gender of victims of police violence:
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
Frequency graph for the manner of death of victims of police violence:
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
Frequency graph for the threat level of victims of police violence:
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
Hover over the map below to see the breakdown of fatal police shootings, divided by the race of the victim. We looked at the total number of deaths in each state by race and following are some of the insights:
We see that the state with the highest level of victims of police violence is California with a total of 885 victims, followed by Texas with a total of 553 and then Florida with 427.
These results are consistent with the populations of these states, with the highest being California, then Texas, and then Florida.
We also observe that the highest number of deaths is for Hispanic people in California, whereas in Texas and Florida there are more fatal shootings of White people.
## `summarise()` has grouped output by 'state'. You can override using the
## `.groups` argument.
Now we look at the age of the suspect shot, as well as their race. We made the following observations:
We see from the boxplot below that the median age for Black people that have been killed by police is 29 years.
White people have a relatively higher median age of 35 years whereas Asian people have the highest median age of around 38 years.
If we look at the age of each victim against the status of their mental health, we can make the following observation: signs of mental illness appear more frequently within the 30s age range while death by police for people age 50 and above are more common for people showing signs of mental illness.
After pursuing the above exploratory analysis, we decided to do some comparative analyses between states and regions to create a specific, measureable, achievable, relevant, and time-oriented research question to pursue for the remainder of the project.
To do this, wee began by dividing the data into regions for easier visualization and comparative analysis. The regions divide each US state as follows:
| Northwest (NW) | Southwest (SW) | Midwest (MW) | Southeast (SE) | Northeast (NE) |
|---|---|---|---|---|
| California | New Mexico | Illinois | Georgia | New York |
| Washington | Arizona | Wisconsin | Alabama | Rhode Island |
| Oregon | Texas | Indiana | Mississippi | Maryland |
| Nevada | Oklahoma | Michigan | Louisiana | Vermont |
| Idaho | Hawaii | Minnesota | Tennessee | Pennsylvania |
| Utah | - | Missouri | North Carolina | Maine |
| Montana | - | Iowa | South Carolina | New Hampshire |
| Colorado | - | Kansas | Florida | New Jersey |
| Wyoming | - | North Dakota | Arkansas | Connecticut |
| Arkansas | - | South Dakota | West Virginia | Massachusetts |
| Arkansas | - | Nebraska | DC | - |
| - | - | Ohio | Virginia | - |
Fatal shootings in the Northwest United States:
## [1] 1551
Fatal shootings in the Southwest United States:
## [1] 1058
Fatal shootings in the Midwest United States:
## [1] 955
Fatal shootings in the Southeast United States:
## [1] 1668
Fatal shootings in the Northeast United States:
## [1] 488
We also create some new columns for the Modeling Portion (Part 5)
The first new block of data we will add will be state level spending per capita on police for the year 2021
Then we will add a new binary variable that have laws that mandate an police officer using their body camera when interacting with members of the public. (For Part 5)
Here we are adding data about the direction a state swung in the 2020 election (Negative = Swing to Trump, Positive = Swing to Biden) (for Part 5)
Finally We will add a variable based on police officers per 100K citizens by state (For part 5)
There will be basic EDA of these new variables within part 5.
We then created two sub-data sets by grouping the data by state and by region for visualization purposes. The contents of both groups are identical, besides their grouping.
Within our data set of 6,574 observations of police shootings from 2015 to 2022 in the United States, is there a correlation between the U.S. state of observation and whether a body camera was turned on during the shooting?
First let’s take a look at our data after it has been grouped by state and reorganized into the following variables:
| Variable | Meaning |
|---|---|
| state | State of observation |
| region | Region of observation |
| stbcp | Body camera on proportion by state |
| genp.p | Proportion of male victims by state |
| smi.p | Proportion of victims by state with signs of mental illness |
| flee.p | Proportion of victims by state the were fleeing |
| att.p | Proportion of victims by state that were attacking |
| armed.p | Proportion of victims by state that were armed |
| MoD.p | Proportion of victims by state that were shot |
| age.avg | Average age by state |
| Non_White_Prop | Proportion of non-White victims by state |
The state data subgroup can be summarized as follows:
## state month year regions
## Length:5720 Length:5720 Length:5720 MW: 955
## Class :character Class :character Class :character NE: 488
## Mode :character Mode :character Mode :character NW:1551
## SE:1668
## SW:1058
##
## spendpc bclaw marg2020 le_per_100k stbcp
## Min. : 390 Min. :0.000 Min. :-43.0 Min. :284 Min. :0.000
## 1st Qu.: 526 1st Qu.:0.000 1st Qu.:-12.0 1st Qu.:379 1st Qu.:0.106
## Median : 608 Median :0.000 Median : 0.2 Median :439 Median :0.132
## Mean : 650 Mean :0.017 Mean : 1.5 Mean :441 Mean :0.143
## 3rd Qu.: 704 3rd Qu.:0.000 3rd Qu.: 16.0 3rd Qu.:479 3rd Qu.:0.180
## Max. :1337 Max. :1.000 Max. : 87.0 Max. :722 Max. :0.429
## gen.p smi.p flee.p att.p armed.p
## Min. :0.800 Min. :0.000 Min. :0 Min. :0.400 Min. :0.818
## 1st Qu.:0.937 1st Qu.:0.197 1st Qu.:0 1st Qu.:0.577 1st Qu.:0.912
## Median :0.946 Median :0.232 Median :0 Median :0.643 Median :0.929
## Mean :0.951 Mean :0.230 Mean :0 Mean :0.639 Mean :0.931
## 3rd Qu.:0.965 3rd Qu.:0.268 3rd Qu.:0 3rd Qu.:0.684 3rd Qu.:0.955
## Max. :1.000 Max. :0.714 Max. :0 Max. :1.000 Max. :1.000
## MoD.p age.avg Non_White_prop
## Min. :0.800 Min. :31.7 Min. :0.000
## 1st Qu.:0.934 1st Qu.:35.4 1st Qu.:0.370
## Median :0.944 Median :36.6 Median :0.509
## Mean :0.949 Mean :36.7 Mean :0.490
## 3rd Qu.:0.968 3rd Qu.:38.2 3rd Qu.:0.587
## Max. :1.000 Max. :47.0 Max. :0.931
The region data subgroup can be summarized as follows:
## state month year spendpc
## Length:5720 Length:5720 Length:5720 Min. : 390
## Class :character Class :character Class :character 1st Qu.: 526
## Mode :character Mode :character Mode :character Median : 608
## Mean : 650
## 3rd Qu.: 704
## Max. :1337
## bclaw marg2020 le_per_100k stbcp gen.p
## Min. :0.000 Min. :-43.0 Min. :284 Min. :0.000 Min. :0.800
## 1st Qu.:0.000 1st Qu.:-12.0 1st Qu.:379 1st Qu.:0.106 1st Qu.:0.937
## Median :0.000 Median : 0.2 Median :439 Median :0.132 Median :0.946
## Mean :0.017 Mean : 1.5 Mean :441 Mean :0.143 Mean :0.951
## 3rd Qu.:0.000 3rd Qu.: 16.0 3rd Qu.:479 3rd Qu.:0.180 3rd Qu.:0.965
## Max. :1.000 Max. : 87.0 Max. :722 Max. :0.429 Max. :1.000
## smi.p flee.p att.p armed.p MoD.p
## Min. :0.000 Min. :0 Min. :0.400 Min. :0.818 Min. :0.800
## 1st Qu.:0.197 1st Qu.:0 1st Qu.:0.577 1st Qu.:0.912 1st Qu.:0.934
## Median :0.232 Median :0 Median :0.643 Median :0.929 Median :0.944
## Mean :0.230 Mean :0 Mean :0.639 Mean :0.931 Mean :0.949
## 3rd Qu.:0.268 3rd Qu.:0 3rd Qu.:0.684 3rd Qu.:0.955 3rd Qu.:0.968
## Max. :0.714 Max. :0 Max. :1.000 Max. :1.000 Max. :1.000
## age.avg Non_White_prop
## Min. :31.7 Min. :0.000
## 1st Qu.:35.4 1st Qu.:0.370
## Median :36.6 Median :0.509
## Mean :36.7 Mean :0.490
## 3rd Qu.:38.2 3rd Qu.:0.587
## Max. :47.0 Max. :0.931
We will now check our data for normality:
Because the plot is relatively linear, we can conclude this data is close enough to normality for our purpose.
Now let us look at the body camera proportions by state. In the below bar graph, TRUE signifies a police body camera that was on, while FALSE indicates the body camera was off:
Number of fatal shootings where the body camera was on:
## body_camera n
## 1 TRUE 905
Number of fatal shootings where the body camera was off:
## body_camera n
## 1 FALSE 5383
This scatter plot shows the proportion of fatal shootings when cameras were on by state (the variable stbcp). Each point on the graph depicts a state’s proportion of shootings where the police body camera was turned on during the incident). We can see that there is very little variation in Southwest, and many differences among states in the Midwest.
Finally, let us check out the mean body camera on proportion for all states:
## [1] 0.143
And the stbcp median body camera on proportion for all states:
## [1] 0.132
We will now perform a chi-square test to see if there is a significant difference between the proportions of each state.
Null: There is no significant differences between US States in the proportion of body cameras being turned on during police shootings
Alternative: There is a significant difference between US State in the proportion of body cameras being turned on during police shootings
Significance Level: a = 0.05
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.106 0.132 0.143 0.180 0.429
##
## Pearson's Chi-squared test
##
## data: contable
## X-squared = 3e+05, df = 2250, p-value <2e-16
With a p-value of 2e-16, we easily pass our significance level of alpha=0.05 and have shown that there exists significant differences between different states’ proportions of body camera usage during fatal police shootings.
This exploratory data analysis has shown that there is significant difference in the level of body camera usage in police shootings between states and regions in the United States. We intend to delve into the reasons why there are differences and research what factors may explain these differences between states. This will require understanding state laws and policies regarding the use of police body cameras. We must also understand the police force consequences for turning off body cameras during police activity in different states.
Studying the use of body cameras in police work is an important topic of study for data-driven policy research in the United States. We hope to be able to apply this correlation between the U.S. state of observation and whether the body camera was on or off during the shooting to state policy on body cameras during police work.
Because of our findings in Part 4, we know there are significant differences in the level of body camera usage in police shootings between US states, but let us see if we can find out what drives those differences.
Our second SMART question:
For the years 2021 and 2022, do
-US Region, -Law Enforcement Officers per capita, -Law Enforcement spending per capita, -body camera mandate laws -2020 presidential election leaning have any influence on a state’s proportion of body camera usage ?
We will use multiple linear regression to build various models to see if any of these variables can be useful predictors.
Note: Because most states that have body camera laws had them take affect in the start of 2021, so we will only be looking at data from 2021 and 2022. This reduces the number of cases in our original data set to 1763. (This is below the 4000 observation threshold, but was approved in class by Professor Faruque)
First let us take a look at the new dataset with its new variables (added in Part 3):
## # A tibble: 6 × 17
## # Groups: state [6]
## state month year regions spendpc bclaw marg2020 le_per_1…¹ stbcp gen.p smi.p
## <chr> <chr> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 WA 10 2022 NW 608 0 19 320. 0.112 0.958 0.336
## 2 OR 10 2022 NW 736 0 16 284. 0.0833 0.979 0.302
## 3 KS 10 2022 MW 553 0 -15 467. 0.133 0.917 0.217
## 4 CA 10 2022 NW 981 0 29 378. 0.184 0.946 0.232
## 5 CO 10 2022 NW 664 0 14 417. 0.124 0.965 0.139
## 6 OK 10 2022 SW 487 0 -33 410. 0.180 0.976 0.216
## # … with 6 more variables: flee.p <dbl>, att.p <dbl>, armed.p <dbl>,
## # MoD.p <dbl>, age.avg <dbl>, Non_White_prop <dbl>, and abbreviated variable
## # name ¹le_per_100k
## [1] 1763
Now for some light EDA to look at the new data:
Histogram on 2020 Election Swing Margin
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Histogram on police spending per capita (for the year 2021)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Histogram on Law Enforcement Officers per 100K citizens
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The following 3D plot compares the following variables: bclaw,stbcp,marg2020
Here are some multivariate models to visualize how the new data relates:
Law Enforcement Officers Per 100K citizens vs 2020 Election Margin
State Body Camera Proportion vs Law Enforcement Officers Per 100K citizens
State Body Camera Proportion vs 2020 Election Margin
State Body Camera Proportion vs US Regions
## Warning: Using size for a discrete variable is not advised.
US Regions vs Law Enforcement Officers Per 100K citizens
## Warning: Using size for a discrete variable is not advised.
The following 3D plot compares the following variables: Body Camera
Proportion, 2020 Elecion Margin, and The State’s Region
And Finally, here is a 3-D plot of Law Enforcement Officer’s per 100K citizens, Police Spending, and a State’s Region
Now that we are familiar with the data, we can start to model with our new state-wide data.
This is model 1: A simple MLRG model that uses all the new variables along with the region variable:
##
## Call:
## lm(formula = stbcp ~ (marg2020 + bclaw + regions + le_per_100k +
## spendpc), data = FD)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.14481 -0.02162 -0.00773 0.00831 0.29707
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.09e-01 1.13e-02 9.68 < 2e-16 ***
## marg2020 1.77e-04 1.22e-04 1.45 0.14625
## bclaw 6.37e-02 1.06e-02 6.00 2.4e-09 ***
## regionsNE -3.52e-02 6.83e-03 -5.16 2.8e-07 ***
## regionsNW -2.00e-02 5.85e-03 -3.42 0.00064 ***
## regionsSE -2.86e-02 4.74e-03 -6.03 2.0e-09 ***
## regionsSW -4.44e-03 4.97e-03 -0.89 0.37116
## le_per_100k -7.15e-05 2.70e-05 -2.64 0.00825 **
## spendpc 1.27e-04 1.48e-05 8.58 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0579 on 1754 degrees of freedom
## Multiple R-squared: 0.183, Adjusted R-squared: 0.179
## F-statistic: 49.1 on 8 and 1754 DF, p-value: <2e-16
## GVIF Df GVIF^(1/(2*Df))
## marg2020 2.99 1 1.73
## bclaw 1.09 1 1.04
## regions 5.43 4 1.24
## le_per_100k 2.19 1 1.48
## spendpc 3.97 1 1.99
## res
## 1 -0.03482
## 2 -0.08165
## 3 -0.00990
## 4 -0.00773
## 5 -0.02162
## 6 0.04839
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The VIF values for model 1 are all within acceptable range. With an R^2
of 0.183, this model is not very good at predicting statewide body
camera usage. We can see that the region variable is not helpful so we
will remove it.
Model 2 uses only the most helpful predictors from the previous model.
##
## Call:
## lm(formula = stbcp ~ (marg2020 + bclaw + spendpc + le_per_100k),
## data = FD)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.14719 -0.02411 -0.00778 0.01628 0.27593
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.15e-01 1.10e-02 10.38 < 2e-16 ***
## marg2020 1.16e-04 1.12e-04 1.04 0.3
## bclaw 4.81e-02 1.05e-02 4.56 5.4e-06 ***
## spendpc 1.16e-04 1.18e-05 9.81 < 2e-16 ***
## le_per_100k -1.06e-04 1.88e-05 -5.65 1.9e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0589 on 1758 degrees of freedom
## Multiple R-squared: 0.152, Adjusted R-squared: 0.15
## F-statistic: 78.5 on 4 and 1758 DF, p-value: <2e-16
## marg2020 bclaw spendpc le_per_100k
## 2.42 1.03 2.45 1.02
## res
## 1 -0.04151
## 2 -0.08839
## 3 0.00578
## 4 -0.00778
## 5 -0.02468
## 6 0.05588
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The VIF values for model 2 are all within acceptable range. With an R^2 of 0.152, this model is even worse at predicting statewide body camera usage.
Let us try analyzing the interaction of law enforcement spending per capita and officers per capita.
Model 3 is just Model 1 again with the aforementioned interaction.
##
## Call:
## lm(formula = stbcp ~ (marg2020 + bclaw + regions + le_per_100k +
## spendpc + I(spendpc * le_per_100k)), data = FD)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.13141 -0.01977 0.00129 0.00907 0.27827
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.19e-01 2.93e-02 17.72 < 2e-16 ***
## marg2020 6.30e-04 1.19e-04 5.30 1.3e-07 ***
## bclaw 4.97e-02 1.00e-02 4.96 7.8e-07 ***
## regionsNE -4.50e-02 6.46e-03 -6.96 4.8e-12 ***
## regionsNW -2.33e-03 5.63e-03 -0.41 0.67844
## regionsSE -1.63e-02 4.53e-03 -3.59 0.00034 ***
## regionsSW 9.48e-03 4.77e-03 1.99 0.04690 *
## le_per_100k -9.59e-04 6.43e-05 -14.90 < 2e-16 ***
## spendpc -4.55e-04 4.12e-05 -11.05 < 2e-16 ***
## I(spendpc * le_per_100k) 1.22e-06 8.13e-08 15.01 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0545 on 1753 degrees of freedom
## Multiple R-squared: 0.276, Adjusted R-squared: 0.272
## F-statistic: 74.3 on 9 and 1753 DF, p-value: <2e-16
## res
## 1 -0.07073
## 2 -0.09132
## 3 0.00790
## 4 0.00515
## 5 -0.03706
## 6 0.04312
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We are ignoring the VIF test for multicolinearity because we are using
an interaction predictor. With an R^2 of 0.276, this model is not good,
much better than the others at predicting statewide body camera
usage.
Since lm3 is our best model (per our R^2), lets try to predict a few made up New US states:
Please notice Eleum and Faraam are identical except their body camera laws same as GW and HW
Now lets plug these new “states” into our model:
## fit lwr upr
## 1 0.204 0.183 0.226
## fit lwr upr
## 1 0.199 0.186 0.213
## fit lwr upr
## 1 0.212 0.189 0.234
## fit lwr upr
## 1 0.271 0.243 0.299
## fit lwr upr
## 1 0.127 0.117 0.137
## fit lwr upr
## 1 0.177 0.155 0.198
## fit lwr upr
## 1 0.129 0.121 0.138
## fit lwr upr
## 1 0.179 0.159 0.199
We can see the difference of fit on states E and F as well as G and H and see the effect body camera laws have.
Though lm3 is our best model, it still is not a great predictor of statewide body camera usage, which can lead us to the following conclusions: